NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Learning sparse log-ratios for high-throughput sequencing data

https://doi.org/10.1093/bioinformatics/btab645

Gordon-Rodriguez, Elliott; Quinn, Thomas P; Cunningham, John P (December 2021, Bioinformatics)
Luigi Martelli, Pier (Ed.)
Abstract Motivation The automatic discovery of sparse biomarkers that are associated with an outcome of interest is a central goal of bioinformatics. In the context of high-throughput sequencing (HTS) data, and compositional data (CoDa) more generally, an important class of biomarkers are the log-ratios between the input variables. However, identifying predictive log-ratio biomarkers from HTS data is a combinatorial optimization problem, which is computationally challenging. Existing methods are slow to run and scale poorly with the dimension of the input, which has limited their application to low- and moderate-dimensional metagenomic datasets. Results Building on recent advances from the field of deep learning, we present CoDaCoRe, a novel learning algorithm that identifies sparse, interpretable and predictive log-ratio biomarkers. Our algorithm exploits a continuous relaxation to approximate the underlying combinatorial optimization problem. This relaxation can then be optimized efficiently using the modern ML toolbox, in particular, gradient descent. As a result, CoDaCoRe runs several orders of magnitude faster than competing methods, all while achieving state-of-the-art performance in terms of predictive accuracy and sparsity. We verify the outperformance of CoDaCoRe across a wide range of microbiome, metabolite and microRNA benchmark datasets, as well as a particularly high-dimensional dataset that is outright computationally intractable for existing sparse log-ratio selection methods. Availability and implementation The CoDaCoRe package is available at https://github.com/egr95/R-codacore. Code and instructions for reproducing our results are available at https://github.com/cunningham-lab/codacore. Supplementary information Supplementary data are available at Bioinformatics online.
more » « less
Full Text Available
Optimizing viral genome subsampling by genetic diversity and temporal distribution (TARDiS) for phylogenetics

https://doi.org/10.1093/bioinformatics/btab725

Marini, Simone; Mavian, Carla; Riva, Alberto; Prosperi, Mattia; Salemi, Marco; Rife Magalis, Brittany (October 2021, Bioinformatics)
Luigi Martelli, Pier (Ed.)
Abstract Summary TARDiS is a novel phylogenetic tool for optimal genetic subsampling. It optimizes both genetic diversity and temporal distribution through a genetic algorithm. Availability and implementation TARDiS, along with example datasets and a user manual, is available at https://github.com/smarini/tardis-phylogenetics
more » « less
Full Text Available
Discovering a sparse set of pairwise discriminating features in high-dimensional data

https://doi.org/10.1093/bioinformatics/btaa690

Melton, Samuel; Ramanathan, Sharad (July 2020, Bioinformatics)
Luigi Martelli, Pier (Ed.)
Abstract Motivation Recent technological advances produce a wealth of high-dimensional descriptions of biological processes, yet extracting meaningful insight and mechanistic understanding from these data remains challenging. For example, in developmental biology, the dynamics of differentiation can now be mapped quantitatively using single-cell RNA sequencing, yet it is difficult to infer molecular regulators of developmental transitions. Here, we show that discovering informative features in the data is crucial for statistical analysis as well as making experimental predictions. Results We identify features based on their ability to discriminate between clusters of the data points. We define a class of problems in which linear separability of clusters is hidden in a low-dimensional space. We propose an unsupervised method to identify the subset of features that define a low-dimensional subspace in which clustering can be conducted. This is achieved by averaging over discriminators trained on an ensemble of proposed cluster configurations. We then apply our method to single-cell RNA-seq data from mouse gastrulation, and identify 27 key transcription factors (out of 409 total), 18 of which are known to define cell states through their expression levels. In this inferred subspace, we find clear signatures of known cell types that eluded classification prior to discovery of the correct low-dimensional subspace. Availability and implementation https://github.com/smelton/SMD. Supplementary information Supplementary data are available at Bioinformatics online.
more » « less
Full Text Available
HAPPI GWAS: Holistic Analysis with Pre- and Post-Integration GWAS

https://doi.org/10.1093/bioinformatics/btaa589

Slaten, Marianne L; Chan, Yen On; Shrestha, Vivek; Lipka, Alexander E; Angelovici, Ruthie (June 2020, Bioinformatics)
Luigi Martelli, Pier (Ed.)
Abstract Motivation Advanced publicly available sequencing data from large populations have enabled informative genome-wide association studies (GWAS) that associate SNPs with phenotypic traits of interest. Many publicly available tools able to perform GWAS have been developed in response to increased demand. However, these tools lack a comprehensive pipeline that includes both pre-GWAS analysis, such as outlier removal, data transformation and calculation of Best Linear Unbiased Predictions or Best Linear Unbiased Estimates. In addition, post-GWAS analysis, such as haploblock analysis and candidate gene identification, is lacking. Results Here, we present Holistic Analysis with Pre- and Post-Integration (HAPPI) GWAS, an open-source GWAS tool able to perform pre-GWAS, GWAS and post-GWAS analysis in an automated pipeline using the command-line interface. Availability and implementation HAPPI GWAS is written in R for any Unix-like operating systems and is available on GitHub (https://github.com/Angelovici-Lab/HAPPI.GWAS.git). Supplementary information Supplementary data are available at Bioinformatics online.
more » « less
Full Text Available

Search for: All records